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Abstract 

While machine learning has proven to be a powerful data-driven solution to many 
real-life problems, its use in sensitive domains has been limited due to privacy concerns. 

A popular approach known as differential privacy offers provable privacy guarantees, 
but it is often observed in practice that it could substantially hamper learning accuracy. 

In this paper we study the learnability (whether a problem can be learned by any 
algorithm) under Vapnik’s general learning setting with differential privacy constraint, 
and reveal some intricate relationships between privacy, stability and learnability. 

In particular, we show that a problem is privately learnable if an only if there 
is a private algorithm that asymptotically minimizes the empirical risk (AERM). In 
contrast, for non-private learning AERM alone is not sufficient for learnability. This 
result suggests that when searching for private learning algorithms, we can restrict 
the search to algorithms that are AERM. In light of this, we propose a conceptual 
procedure that always finds a universally consistent algorithm whenever the problem is 
learnable under privacy constraint. We also propose a generic and practical algorithm 
and show that under very general conditions it privately learns a wide class of learning 
problems. Lastly, we extend some of the results to the more practical (e, 5)-differential 
privacy and establish the existence of a phase-transition on the class of problems that 
are approximately privately learnable with respect to how small 6 needs to be. 

Keywords: differential privacy, learnability, characterization, stability, privacy-preserving 
machine learning 
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1 Introduction 


Increasing public concerns regarding data privacy have posed obstacles in the development 
and application of new machine learning methods as data collectors and curators may no 
longer be able to share data for research purposes. In addition to addressing the original goal 
of information extraction, privacy-preserving learning also requires the learning procedure 
to protect sensitive information of individual data entries. For example, the second Netflix 
Prize competition was canceled in response to a lawsuit and Federal Trade Commission 
privacy concerns, and the National Institute of Health decided in August 2008 to remove 
aggregate Genome-Wide Association Studies (GWAS) data from the public web site, after 
learning about a potential privacy risk. 

A major challenge in developing privacy-preserving learning methods is to quantify formally 
the amount of privacy leakage, given all possible and unknown auxiliary information the 
attacker may have, a challenge in part addressed by the notion of differential privacy 
(Dwork, 2006; Dwork et al., 2006b). Differential privacy has three main advantages over 
other approaches: (1) it rigorously quantifies the privacy property of any data analysis 
mechanism; (2) it controls the amount of privacy leakage regardless of the attacker’s resource 
or knowledge, (3) it has useful interpretations from the perspectives of Bayesian inference 
and statistical hypothesis testing, and hence fits naturally in the general framework of 
statistical machine learning, e.g., see (Dwork & Lei, 2009; Wasserman & Zhou, 2010; Smith, 
2011; Lei, 2011; Wang et ah, 2015), as well as applications involving regression (Chaudhuri 
et ah, 2011; Thakurta k. Smith, 2013) and GWAS data (Yu et ah, 2014), etc. 

In this paper we focus on the following fundamental question about differential privacy and 
machine learning: What problems can we learn with differential privacy? Most literature 
focuses on designing differentially private extensions of various learning algorithms, where 
the methods depend crucially on the specific context and differ vastly in nature. But 
with the privacy constraint, we have less choice in developing learning and data analysis 
algorithms. It remains unclear how such a constraint affects our ability to learn, and if it is 
possible to design a generic privacy-preserving analysis mechanism that is applicable to a 
wide class of learning problems. 


Our Contributions We provide a general answer to the relationship between learnability 
and differential privacy under Vapnik’s General Learning Setting (Vapnik, 1995) in four 
aspects. 

1. We characterize the subset of problems in the General Learning Setting that can be 
learned under differential privacy. Specifically, we show that a sufficient and necessary 
condition for a problem to be privately learnable is the existence of an algorithm that is 
differentially private and asymptotically minimizes the empirical risk. This characterization 
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generalizes previous studies of the subject (Kasiviswanathan et ah, 2011; Beimel et ah, 
2013a) that focus on binary classification in discrete domain under the PAC learning model. 
Technically, the result relies on the now well-known intuitive observation that “privacy 
implies algorithmic stability” and the argument in Shalev-Shwartz et al. (2010) that shows 
a variant of algorithmic stability is necessary for learnability. 

2. We also introduce a weaker notion of learnability, which only requires consistency for 
a class of distributions Ti. Problems that are not privately learnable (a surprisingly large 
class that includes simple problems such as 0-1 loss binary classihcation in continuous 
feature domain (Chaudhuri &: Hsu, 2011)) are usually private S-learnable for some “nice” 
distribution class D. We characterize the subset of private S-learnable problems that are 
also (non-privately) learnable using conditions analogous to those in distribution-free private 
learning. 

3. Inspired by the equivalence between privacy learnability and private AERM, we propose 
a generic (but impractical) procedure that always finds a consistent and private algorithm 
for any privately learnable (or S-learnable) problems. We also study a specific algorithm 
that aims at minimizing the empirical risk while preserving the privacy. We show that 
under a sufficient condition that relies on the geometry of the hypothesis space and the 
data distribution, this algorithm is able to privately learn (or 2D-learn) a large range of 
learning problems including classification, regression, clustering, density estimation and etc, 
and it is computationally efficient when the problem is convex. In fact, this generic learning 
algorithm learns any privately learnable problems in the PAC learning setting (Beimel et ah, 
2013a). It remains an open problem whether the second algorithm also learns any privately 
learnable problem in the General Learning Setting. 

4. Lastly, we provide a preliminary study of learnability under the more practical (e, 5)- 
differential privacy. Our results reveal that whether there is separation between learnability 
and approximate private learnability depends on how fast 6 is required to go to 0 with 
respect to the size of the data. Finding where the exact phase transition occurs is an open 
problem of future interest. 

Our primary objective is to understand the conceptual impact of differential privacy 
and learnability under a general framework and the rates of convergence obtained in the 
analysis may be suboptimal. Although we do provide some discussion on polynomial time 
approximations to the proposed algorithm, learnability under computational constraints is 
beyond the scope of this paper. 


Related work While a large amount of work has been devoted to hnding consistent (and 
rate optimal) differentially private learning algorithms in various settings (e.g., Chaudhuri 
et ah, 2011; Kifer et ah, 2012; Jain & Thakurta, 2013; Bassily et ah, 2014), the characteri¬ 
zation of privately learnable problems were only studied in a few special cases. 


4 


Kasiviswanathan et al. (2011) showed that, for binary classification with a finite discrete 
hypothesis space, anything that is non-privately learnable is privately learnable under the 
agnostic Probably Approximately Correct (PAC) learning framework, therefore “finite VC- 
dimension” characterizes the set of private learnable problems in this setting. Beimel et al. 
(2013a) extends Kasiviswanathan et al. (2011) by characterizing the sample complexity of 
the same class of problems, but the result only applies to the realizable (non-agnostic) case. 
Chaudhuri & Hsu (2011) provided a counter-example showing that for continuous hypothesis 
space and data space, there is a gap between learnability and learnability under privacy 
constraint. They proposed to fix this issue by either weakening the privacy requirement to 
labels only or by restricting the class of potential distribution. While meaningful in some 
cases, these approaches do not resolve the learnability problem in general. 

A key difference of our work from Kasiviswanathan et al. (2011); Chaudhuri &: Hsu (2011); 
Beimel et al. (2013a) is that we consider a more general class of learning problems and 
provide a proper treatment in a statistical learning framework. This allows us to capture a 
wider collection of important learning problems (see Figure 1(a) and Table 1). 

It is important to note that despite its generality, Vapnik’s general learning setting still 
does not nearly cover the full spectrum of private learning. In particular, our results do not 
apply to improper learning (learning using a different hypothesis class) as considered in 
Beimel et al. (2013a) or to structural loss minimization (the loss function jointly take all 
data points as input) considered in Beimel et al. (2013b). Also, our results do not address 
the sample complexity problem, which remains open in the general learning setting even 
for learning without privacy constraints. 

Our characterization of private learnability (and private iD-learnability) in Section 3 uses 
a recent advance in the characterization of general learnability given by Shalev-Shwartz 
et al. (2010). Roughly speaking, they showed that a problem is learnable if and only if 
there exists an algorithm that (i) is stable under small perturbation of training data, and 
(ii) behaves like empirical risk minimization (ERM) asymptotically. We also makes use of 
a folklore observation that “Privacy => Stability Generalization”. The connection of 
privacy and stability appeared as early as 2008 in a conference version of Kasiviswanathan 
et al. (2011). Further connection to “generalization” recently appeared in blog posts^, 
stated as a theorem in Appendix F of Bassily et al. (2014), and was shown to hold with 
strong concentration in Dwork et al. (2015b). 

Dwork et al. (2015b) is part of an independent line of work (Hardt & Ullman, 2014; Bassily 
et ah, 2015; Dwork et ah, 2015a; Blum & Hardt, 2015) on adaptive data analysis, which also 

^For instance, Frank McSherry described in a blog post an example of exploiting 
differential privacy for measure concentration http://windowsontheory.org/2014/02/04/ 
dif f erential-privacy-f or-measure-concentration/; Moritz Hardt discussed the connection of 
differential privacy to stability and generalization in his blog post http://blog.mrtz.org/2014/01/13/ 
false-discovery. 
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stems from the observation that privacy implies stability and generalization. Comparing 
to adaptive data analysis works, our focus is quite different. Adaptive data analysis work 
focus on the impact of k on how fast the maximum absolute error of A:-adaptively chosen 
queries goes to 0 as a function of n, while this paper is concerned with whether the error 
can go to 0 at all for each learning problem when we require the learning algorithm be 
differentially private with e < oo. Nonetheless, we acknowledge that Theorem 7 in Dwork 
et al. (2015b) provides an interesting alternative proof for “differentially private learners 
have small generalization error”, when choosing the statistical query as evaluating a loss 
function at a privately learned hypothesis. The connection is not quite obvious and we 
provide a more detailed explanation in Appendix B. 

The main tool used in the construction of our generic private learning algorithm in Section 4 
is the Exponential Mechanism (McSherry &: Talwar, 2007), which provides a simple and 
differentially-private approximation to the maximizer of a score function among a candidate 
set. In the general learning context, we use the negative empirical risk as the utility function, 
and apply the exponential mechanism to a possibly pre-discretized hypothesis space. This 
exponential mechanism approach was used in Bassily et al. (2014) for minimizing convex and 
Lipschitz functions. The sample discretization procedure has been considered in Chaudhuri 
&: Hsu (2011) and Beimel et al. (2013a). Our scope and proof techniques are different. 
Our strategy is to show that, under some general regularity conditions, the exponential 
mechanism is stable and behaves like ERM. Our sublevel set condition has the same flavor 
as that in the proof of Bassily et al. (2014, Theorem 3.2), although we do not need the loss 
function to be convex or Lipschitz. 

Stability, privacy and generalization were also studied in Thakurta &: Smith (2013) with 
different notions of stability. More importantly, their stability is used as an assnmption 
rather than a consequence, so their result is not directly comparable to ours. 


2 Background 

2.1 Learnability under the General Learning Setting 

In the General Learning Setting of Vapnik (1995), a learning problem is characterized by a 
triplet Here Z is the sample space (with a cr-algebra). The hypothesis space 7i is 

a collection of models such that each h G T-L describes some structures of the data. The 
loss function i : T-L x Z measures how well the hypothesis h explains the data instance 
z G Z. Eor example, in supervised learning problems Z = X x y where X is the feature 
space and y is the label space; H defines a collection of mapping h : X ^ y-, and i{h, z) 
measures how well h predicts the feature-label relationship z = (x, y) G Z. This setting 
includes problems with continuous input/output in potentially infinite dimensional spaces 
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(a) Illustration of general learning setting. Examples of known DP extensions are circled in maroon. 
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What can be learned privately?", 
Kasiviswanathan et. al., 08) 
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General Learning Setting 

(b) Our characterization of private learnable problems in the general learning setting (in blue). 


Figure 1: The Big Picture: illustration of general learning setting and our contribution in 
understanding differentially private learnability. 
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(e.g. RKHS methods), hence is much more general than PAC learning. In addition, the 
general learning setting also covers a variety of unsupervised learning problems, including 
clustering, density estimation, principal component analysis (PCA) and variants (e.g.. 
Sparse PCA, Robust PCA), dictionary learning, matrix factorization and even Latent 
Dirichlet Allocation (LDA). Details of these examples are given in Table 1 (the first few are 
extracted from Shalev-Shwartz et al. (2010)). 

To account for the randomness in the data, we are primarily interested in the case where the 
data Z = {zi,Zn} £ are independent samples drawn from an unknown probability 
distribution D on Z. We denote such a random sample by Z ~ D”. For a given distribution 
D, let R{h) be the expected loss of hypothesis h and R{h, Z) the empirical risk from a 
sample Z G Z'^: 

1 ” 

R(h) = z), R(h, Z) = - V £(/i, Zi). 

n 

i=l 

The optimal risk R* = R{h) and we assume that it is achieved by an optimal h* G %. 

Similarly, the minimal empirical risk R*{Z) = inf/jg-^ R(/i, Z) is achieved by h*{Z) G R. 
For a possibly randomized algorithm A : Z^ —?■ R that learns some hypothesis A{Z) G R 
given data sample Z, we say A is consistent if 

lim {Ehr^J^^z)R{h) - R*) = 0. (1) 

In addition, we say A is consistent with rate ^(n) if 

Ezr~.v”- {^hr^A{z)R{h) — R*) < Cin), where lim ^(n) —)• 0. (2) 

Since the distribution R is unknown, we cannot adapt the algorithm A to R, especially 
when privacy is a concern. Also, even if A is pointwise consistent for any distribution 
R, it may have different rates for different R and potentially be arbitrarily slow for some 
R. This makes it hard to evaluate whether A indeed learns the learning problem and 
forbids the study of the learnability problem. In this study, we adopt the stronger notion of 
learnability considered in Shalev-Shwartz et al. (2010), which is a direct generalization of 
PAC-learnability (Valiant, 1984) and agnostic PAC-learnability (Kearns et al., 1992) to the 
General Learning Setting as studied by Haussler (1992). 

Definition 1 (Learnability, Shalev-Shwartz et ah, 2010). A learning problem is learnable 
if there exists an algorithm A and rate such that A is consistent with rate (,{n) for 

any distribution R defined on Z. 

This definition requires consistency to hold universally for any distribution R with a uniform 
(distribution-independent) rate This type of problem is often called distribution-free 

learning (Valiant, 1984), and an algorithm is said to be universally consistent with rate 
f{n) if it realizes the criterion. 


Problem 

Hypothesis class H 

Z or X xy 

Loss function £ 

Binary classification 


{0,1}" X {0,1} 

l{h{x) A y) 

Regression 

^ C {/ : [0,1]'* ^ K} 

[0,1]" X R 

\h{x)-y\^ 

Density Estimation 

Bounded distributions on Z 

2 C R" 

-\og{h{z)) 

K-means Clustering 

{S' C K'* : |S| = k} 

.2 C R" 

minjjc — zll^ 

RKHS classification 

Bounded RKHS 

RKHS X {0,1} 

max{0,1 — y(x, h)} 

RKHS regression 

Bounded RKHS 

RKHSxR 

l(x,h} - t/p 

Sparse PCA 

Rank-r projection matrices 

R" 

\\hz - 2 p -1- A h i 

Robust PCA 

All subspaces in 

R" 

\\Vh{z) — 2 1 + Arank(h) 

Matrix Completion 

All subspaces in 

R" X {1,0}" 

min||i/ o (6 — a;)||^ -1- Arank(h) 

Dictionary Learning 

All dictionaries G 

R" 

minll/ife — zW^ + AII foil 1 

Non-negative MF 

All dictionaries G R^^’’ 

R" 

min \\hb — zll"^ 

beRff 

Subspace Clustering 

A set of k rank-r subspaces 

R" 

rmu\\Vb{z) — zW^ 

Topic models (LDA) 

{P(word|topic)} 

Documents 

- max X] logPi,,h(w) 

6G{P(Topic)} w^z 


Table 1: An illustration of problems in the General Learning setting. 


2.2 Differential privacy 

Differential privacy requires that if we arbitrarily perturb a database by only one data 
point, the output should not differ much. Therefore, if one conducts a statistical test 
for whether any individual is in the database or not, the false positive and false negative 
probabilities cannot both be small (Wasserman & Zhou, 2010). Formally, define “Hamming 
distance” 

d{Z,Z')-.= #{i = l,...,n-.Zi^z[]. (3) 

Definition 2 (e-Differential Privacy, Dwork, 2006). An algorithm A is e-differentially 
private, if 

¥{A{Z) £H)< exp(e)P(.4(Z') e H) 
forM Z, Z' obeying d{Z,Z') = 1 and any measurable subset H (ZH. 

There are weaker notions of differential privacy. For example (e, (5)-differential privacy 
allows for a small probability 5 where the privacy guarantee does not hold. In this paper, 
we will mainly work with the stronger e-differential privacy. In Section 6 we discuss the 
problem of (e, (5)-differential privacy and extend some of the results to this setting. 

Our objective is to understand whether there is a gap between learnable problems and 
privately learnable problems in the general learning setting, and to quantify the tradeoff 
required to protect privacy. To achieve this objective, we need to show the existence of 
an algorithm that learns a class of problems while preserving differential privacy. More 
formally, we define 
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Definition 3 (Private learnability). A learning problem is privately learnable with rate ^(n) 
if there exists an algorithm A that satisfies both universal consistency (as in Definition 1 ) 
with rate f{n) and e-differential privacy with privacy parameter e < oo. 

We can view the consistency requirement Definition 3 as a measure of utility. This utility 
is not a function of the observed data, however, but rather how the results generalize to 
unseen data. 

The following lemma shows that the above definition of private learnability is actually 
equivalent to a seemingly much stronger condition with a vanishing privacy loss e. 
Lemma 4. If there is an e-DP algorithm that is consistent with rate f{n) for some constant 
0 < e < oo, then there is a ^ — e~^)-DP algorithm that is consistent with rate ^(\/n). 

The proof, given in Appendix A.l, uses a subsampling theorem adapted from Beimel et al. 
(2014, Lemma 4.4). 

There are many approaches to design differentially private algorithms, such as noise per¬ 
turbation using Laplace noise (Dwork, 2006; Dwork et ah, 2006b) and the Exponential 
Mechanism (McSherry &: Talwar, 2007). Our construction of generic differentially private 
learning algorithms applies the Exponential Mechanism to penalized empirical risk mini¬ 
mization. Our argument will make use of a general characterization of learnability described 
below. 


2.3 Stability and Asymptotic ERM 


An important breakthrough in learning theory is a full characterization of all learnable 
problems in the General Learning Setting in terms of stability and empirical risk minimization 
(Shalev-Shwartz et ah, 2010). Without assuming uniform convergence of empirical risk, 
Shalev-Shwartz et al. showed that a problem is learnable if and only if there exists 
a “strongly uniform-RO stable” and “always asymptotically empirical risk minimization” 
(Always AERM) randomized algorithm that learns the problem. Here “RO” stands for 
“replace one”. Also, any strongly uniform-RO stable and “universally” AERM (weaker 
than “always” AERM) learning rule learns the problem consistently. Here we give detailed 
definitions. 

Definition 5 (Universally/Always AERM, Shalev-Shwartz et ah, 2010). A (possibly ran¬ 
domized) learning rule A is Universally AERM if for any distribution D defined on domain 


Z 




E.h^A(Z)R{h,Z)-R*{Z) 


0 , 


as re —>■ oo 


where R*{Z) is the minimum empirical risk for data set Z. We say A is Always AERM, if 
in addition, 

sup Z) — ii*(Z) —)• 0, as re —>■ oo . 
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Figure 2: A summary of the relationships of various notions revealed by onr analysis. 

Definition 6 (Strongly Uniform RO-Stability, Shalev-Shwartz et ah, 2010). An 
algorithm A is strongly uniform RO-stable if 

sup sup z) - z)| 0 asn^ oo. 

zeZ Z, Z' 6 Z"-, 
d(Z, Z') = 1 

where d{Z,Z') is defined in (3), in other word, Z and Z' can differ by at most one data 
point. 

Since we will not deal with other variants of algorithmic stability in this paper (e.g., 
hypothesis stability (Kearns &: Ron, 1999), uniform stability (Bousquet & Elisseeff, 2002) 
and leave-one-out (LOO) stability in Mukherjee et al. (2006)), we simply call Definition 6 
stability or uniform stability. Likewise, we will refer to e-differential privacy as just “privacy” 
although there are several other notions of privacy in the literature. 


3 Characterization of private learnability 

We are now ready to state our main result. The only assumption we make is the uniform 
boundedness of the loss function. This is also assumed in Shalev-Shwartz et al. (2010) for 
the learnability problem without privacy constraints. Without loss of generality, we can 
assume 0 < i{h, z) <1. 

Theorem 7. Given a learning problem the following statements are equivalent. 

1. The problem is privately learnable. 

2. There exists a differentially private universally AERM algorithm. 
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3. There exists a differentially private always AERM algorithm. 

The proof is simple yet revealing, we will present the arguments for 2 => 1 (sufficiency 
of AERM) in Section 3.1 and 1 3 (necessity of AERM) in Section 3.2. 3 2 follows 

trivially from the definition of “always” and “universal” AERM. 

The theorem says that we can stick to ERM-like algorithms for private learning, despite 
that ERM may fail for some problems in the (non-private) general learning setting (Shalev- 
Shwartz et ah, 2010). Thus a standard procedure for finding universally consistent and 
differentially private algorithms would be to approximately minimize the empirical risk 
using some differentially private procedures (Chaudhuri et ah, 2011; Kifer et ah, 2012; 
Bassily et ah, 2014). If the utility analysis reveals that the method is AERM, we do not 
need to worry about generalization as it is guaranteed by privacy. This consistency analysis 
is considerably simpler than non-private learning problems where one typically needs to 
control generalization error either via uniform convergence (VC-dimension, Rademacher 
complexity, metric entropy, etc) or to adopt the stability argument (Shalev-Shwartz et ah, 
2010 ). 

This result does not imply that privacy is helping the algorithm to learn in any sense, as the 
simplicity is achieved at the cost of having a smaller class of learnable problems. A concrete 
example of a problem being learnable but not privately learnable is given in (Chaudhuri & 
Hsu, 2011) and we will revisit it in Section 3.3. For some problems where ERM fails, it 
may not be possible to make it AERM while preserving privacy. In particular, we were not 
able to privatize the problem in Section 4.1 of Shalev-Shwartz et ah (2010). 

To avoid any potential misunderstanding, we stress that Theorem 7 is a characterization 
of learnability, not learning algorithms. It does not prevent the existence of a universally 
consistent learning algorithm that is private but not AERM. Also, the characterization given 
in Theorem 7 is about consistency, and it does not claim anything on sample complexity. 
An algorithm that is AERM may be suboptimal in terms of convergence rate. 


3.1 Sufficiency: Privacy implies stability 

A key ingredient in the proof of sufficiency is a well-known heuristic observation that 
differential privacy by definition implies uniform stability, which is useful in its own 
right. 

Lemma 8 (Privacy Stability). Assume 0 < £{h,z) < 1, any e-differentially private 
algorithm satisfies (e^ — l)-stability. Moreover i/e < 1 it satisfies 2e-stability. 

The proof of this lemma comes directly from the definition of differential privacy so it is 
algorithm independent. The converse, however, is not true in general (e.g., a non-trivial 
deterministic algorithm can be stable, but not differentially private.) 
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Corollary 9 (Privacy + Universal AERM => Consistency). If a learning algorithm A is 
e{n)-differentially private and A is universally AERM with rate then A is universally 
consistent with rate ^(n) + — 1 = 0(^(n) + e(n)). 

The proof of Corollary 9, provided in the Appendix, combines Lemma 8 and the fact that 
consistency is implied by stability and AERM (Theorem 35). Our Theorem 35 is based on 
minor modifications of Theorem 8 in Shalev-Shwartz et al. (2010). In fact, Corollary 9 can 
be stated in a stronger per distribution form, since universality is not used in the proof. We 
will revisit this point when we discuss a weaker notion of private learnability below. 

Lemma 4 and Corollary 9 together establishes 2 => 1 in Theorem 7. 

If for a problem privacy and always AERM cannot coexist, then the problem is not privately 
learnable. This is what we will show next. 

3.2 Necessity: Consistency implies Always AERM 

To prove that the existence of an always AERM learning algorithm is necessary for any 
private learnable problems, it suffices to construct such a learning algorithm from 

or each learnable problem, any universally consistent learning algorithm. 

Lemma 10 (Consistency + Privacy Private Always AERM). If A is a universally 
consistent learning algorithm satisfying e-DP with any e > 0 and consistent with rate f,{n), 
then there is another universally consistent learning algorithm A' that is always AERM 
with rate and satisfies — e~^)-DP. 

Lemma 10 is proved in Appendix A.2. The proof idea is to run A on a size 0{^/n) random 
subsample of Z, which will be universally consistent with a slower rate, differentially private 
with e(n) —>■ 0 (Lemma 34), and at the same time always AERM. The last part uses an 
argument in Lemma 24 of Shalev-Shwartz et al. (2010) which appeals to the universality of 
A’s consistency on a specihc discrete distribution supported on the given data set Z. 

As pointed out by an anonymous reviewer, there is a simpler proof by invoking Theorem 10 
of Shalev-Shwartz et al. (2010) that says any consistent and generalizing algorithm must 
be AERM and a result (e.g., Bassily et ah, 2014, Appendix F) that says “privacy 
generalization”. This is a valid observation. But their Theorem 10 is proven using a detour 
through “generalization”, which leads to a slower rate than what we are able to obtain in 
Lemma 10 using a more direct argument. 
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3.3 Private Learnability vs. Non-private Learnability 

Now we have a characterization of all privately learnable problems, a natural question to 
ask is that whether any learnable problem is also privately learnable. The answer is “yes” 
for learning in Statistical Query (SQ)-model and PAC Learning model (binary classification) 
with finite hypothesis space, and is “no” for continuous hypothesis space (Chaudhuri k, 
Hsu, 2011). 

By definition, all privately learnable problems are learnable. But now that we know that 
privacy implies generalization, it is tempting to hope that privacy can help at least some 
problem to learn better than any non-private algorithm. In terms of learnability, the question 
becomes: Could there be a (learnable) problem that is exclusively learnable through private 
algorithms? We now show that such a problem does not exist. 

Proposition 11. If a learning problem is learnable by an e-DP algorithm A, then it is also 
learnable by a non-private algorithm. 

The proof is given in Appendix A.3. The idea is that A{Z) defines a distribution over %. 
Pick an z ^ Z. 11 z ^ Z, algorithm A! = A. Otherwise, A!{Z) samples from a slightly 
different distribution than A{Z) that does not affect the expectation much. 

On the other hand, not all learnable problems are privately learnable. This can already be 
seen from Chaudhuri & Hsu (2011), where the gap between learning and private learning 
is established. We revisit Chaudhuri k Hsu’s example in our notation under the general 
learning setting and produce an alternative proof by showing that differential privacy 
contradicts always AERM, then invoking Theorem 7 to show the problem is not privately 
learnable. 

Proposition 12 (Chaudhuri k Hsu, 2011, Theorem 5). There exists a problem that is 
learnable by a non-private algorithm, but not privately learnable. In particular, any private 
algorithm cannot be always AERM in this problem. 

We describe the counterexample and re-establish the impossibility of private learning for 
this problem using the contrapositive of Theorem 7, which suggests that if privacy and 
always AERM algorithm cannot coexist for some problem, then the problem is not privately 
learnable. 

Consider the binary classification problem with X = [0,1], y = {0,1} and 0-1 loss function. 
Let TL be the collection of threshold functions that output h{x) = 1 \i x > h and h{x) = 0 
otherwise. This class has VC-dimension 1, and hence the problem is learnable. 

Next we will construct K = |'exp(enn)] data sets such that if TL — 1 of them obey AERM, 
the remaining one cannot be. Let r/ = l/exp(en), K := \l/rj\. Let hi,h 2 , ...,hK be a 
disjoint thresholds such that they are at least rj apart and [hi — rj/H, hi ry/S] are disjoint 
intervals. 
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If we take Zj C [hi — rj/S, hi + 77 / 8 ] with half of the points in [hi — rj/3, hi) and the other 
half in (hi, hi + rj/S] and we label each data point in it with l{z > hi), then empirical risk 
R{hi, Zi) = 0 Vi = 1, K. So for any AERM learning rule, Zj) —)• 0 for all i. 

For some sufficiently large n, Zj) < 0.1. 

Now consider Zi, 

K 

F{A{Zi) ^ [hi - 77 / 3 , hi + 77 / 3 ]) > ^ P(Vl(Zi) e [hi - rjjS, hi+rj/3]), 

i=2 

since these intervals are disjoint. Then by the dehnition of e-DP, 

¥{A{Zi) E [hi - 77 / 3 , hi + 77 / 3 ]) > exp{-en)¥{A{Zi) E [hi - 77 / 8 , hj -f 77 / 3 ]). (4) 

It follows that P(Vl(Zj) E [hi — 77 / 3 , hj -f 77 / 3 ]) > 0.9 otherwise Ejj..^_ 4 ( 2 .)R(h, Zj) > 0.1, 
therefore 

P(Vl(Zi) ^ [hi — 77 / 3 , hi -f 77 / 3 ]) > K exp(—£77)0.9 > 0.9, (5) 

and Ejj..^_ 4 (^.)i?(h, Zj) > 0.9 x 1 = 0.9, which violates the “always AERM” condition that 
requires E;j..^_ 4 ( 2 ^)R(h, Zi) < 0.1. Therefore, the problem is not privately learnable. 

As is pointed out by an anonymous reviewer, the same conclusion of this impossibility 
result of privately learning thresholds on [0,1] can be drawn numerically through the 
characterization of the sample complexity (Beimel et ah, 2013a), via the bound that depends 
logarithmically on the log(|?^|) and on [0,1] this number is infinite. The above analysis 
provides different insights about the problem. We will be using it again for understanding 
the separation of learnability and learnability under (e, (5)-Differential Privacy later in 
Section 6. 


3.4 Private 2)-learnability 

The above example implies that even very simple learning problems may not be privately 
learnable. To fix this caveat, note that most data sets of practical interest have nice 
distributions. Therefore, it makes sense to consider a smaller class of distributions, e.g., 
smooth distributions that have bounded kth order derivative, or those having bounded 
total variation. These are common assumptions in non-parametric statistics, such as kernel 
density estimation, smoothing spline regression and mode clustering. Similarly, in high 
dimensional statistics, there are often assumptions on the structures of the underlying 
distribution, such as sparsity, smoothness, and low-rank conditions. 

Definition 13 ((Private) 2)-learnability). We say a learning problem (Z,T-L,i) is ID- 
learnable if there exists a learning algorithm A that is consistent for every unknown distri¬ 
bution V £ D. If in addition, the problem is D-learnable under e-differential privacy for 
some 0 < e < 00 , then we say the problem is privately T)-learnable. 


15 


Almost all of our arguments hold in a per distribution fashion, therefore they also hold for 
any such subclass 3. The only exception is the necessity of “always AERM” (Lemma 10), 
where we used the universal consistency on an arbitrary discrete uniform distribution in 
the proof. The characterization still holds if the class contains all finite discrete uniform 
distributions. For general distribution classes, we characterize private iD-learnability using 
a weaker “universally AERM” (instead of “always AERM”) under the assumption that the 
problem itself is learnable in a distribution-free setting without privacy constraints. 
Lemma 14 (private 2)-learnability private iD-universal AERM). If an e-DP algorithm A 
is 'll)-universally consistent with rate f{n) and the problem itself is learnable in a distribution- 
free sense with rate f!{n), then there exists a 'll)-universally consistent learning algorithm A! 
that is T)-universally AERM with rate 12^'(n^/"^)-|-;^-|-^(yTi) and satisfies -^{e'^ — e~^)-DP. 

The proof, given in Appendix A.4, shows that the algorithm A! that applies M to a random 
subsample of size [\/nJ is AERM for any distribution in the class T). 

Theorem 15 (Characterization of private iD-learnability). A problem is privately 'S- 
learnable if there exists an algorithm that is D-universally AERM and differentially private 
with privacy loss e(n) —)• 0. If in addition, the problem is (distribution-free and non-privately) 
learnable, then the converse is also true. 

Proof. The “if” part is exactly the same as the argument in Section 3.1, since both Lemma 8 
and Lemma 9 holds for each distribution independently. Under the additional assumption 
that the problem itself is learnable (distribution-free and non-privately), the “only if” part 
is given by Lemma 14. ■ 

This result may appear to be unsatisfactory due to the additional assumption of learn- 
ability. It is clearly a strong assumption because many problems that are ID-learnable 
for a practically meaningful T) are not actually learnable. We provide one such example here. 

Example 16. Let the data space be [0,1], the hypothesis space be the class of all finite 
subset of [0,1] and the loss function £{h,z) = Iz^h- This problem is not learnable, and 
not even D-learnable when D is the class of all discrete distributions with finite number of 
possible values. But it is D-learnable when D is further restricted with an upper bound on 
the total number of possible values. 

Proof. For any discrete distribution with a finite support set, there is an h € B such that 
the optimal risk is 0. Assume the problem is learnable with rate ^(n), then for some n 
f{n) < 0.5. However, we can always construct a uniform distribution over 3n elements 
and it is information-theoretically impossible for any estimators based on n samples from 
the distribution to achieve a risk better than 2/3. The problem is therefore not learnable. 
When we assume an upper bound N on the maximum number of bins of the underlying 
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distribution, then the ERM which outputs just the support of all observed data will be 
universally consistent with rate ^(n) = N/n. ■ 


It turns out that we cannot hope to completely remove the assumption from Theorem 15. 
The following example illustrates that some form of qualification (implied by the learnability 
assumption) is necessary for the converse statement to be true. 

Example 17. Consider the learning problem in Example 16. Let Ti be the class of all 
continuous distributions. There is a learning problem that is s privately D-learnable but no 
private AERM algorithm exists. 

Proof. Let the learning problem be that in Example 16 and 3 be the class of all continuous 
distributions defined on [0,1]. Consider The learning algorithm A{Z) always returns h = tj). 
The optimal risk for any continuous distribution is 1 because any finite subset is of measure 
0, output 0 is 0-consistent and 0-generalizing, but not AERM, since the minimum empirical 
risk is 0. A is also 0-differentially private, therefore the problem is privately 2)-learnable for 
2) being the set of all continuous distributions. 

However, it is not privately 2)-learnable via an AERM, i.e., no private AERM algorithm 
exists for this problem. We prove this by contradiction. Assume an e-DP AERM algorithm 
exists, the subsampling lemma ensures the existence of an e(n)-DP AERM algorithm A! 
with e(n) —)• 0. A! is therefore generalizing by stability, and it follows that the A! has 
risk converging to 0. But there is no /i G "R such that R{h) < 1, giving the 

contradiction. ■ 

Interestingly, this problem is 2)-learnable via a non-private AERM algorithm, which always 
outputs h = Z. This is 0-consistent, 0-AERM but not generalizing. This example suggests 
that 2)-learnability and learnability are quite different because for learnable problems, if 
an algorithm is consistent and AERM, then it must also be generalizing (Shalev-Shwartz 
et ah, 2010, Theorem 10). 


3.5 A generic learning algorithm 

The characterization of private learnability suggests a generic (but impractical) procedure 
that learns all privately learnable problems (in the same flavor as the generic algorithm in 
Shalev-Shwartz et al. (2010) that learns all learnable problems). This is to solve 


argmin 
A is e-DP 


e + sup 


^h^A{z)R{h, Z) - Z) 


( 6 ) 
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or to privately S)-learn the problem when (6) is not feasible 


argmin 
{A,e): 

A is e-DP 

Theorem 18. Assume the problem is learnable. If the problem is private learnable, (6) will 
always output a universally consistent private learning algorithm. If the problem is private 
D-learnable, (7) will always output a -universally consistent private learning algorithm. 

Proof. If the problem is private learnable, by Theorem 7 there exists an algorithm A that 
is e(n)-DP and always AERM with rate ^{n) and e(n) + f{n) —?■ 0. This A is a witness in 
the optimization so we know that any minimizer of (6) will have a objective value that is 
no greater than e(n) + f{n) for any n. Corollary 9 concludes its universal consistency. The 
second claim follows from the characterization of private H-learnability in Theorem 15. ■ 

It is of course impossible to minimize the supremum over any data Z, nor is it possible to 
efficiently search over the space of all algorithms, let alone DP algorithms. But conceptu¬ 
ally, this formulation may be of interest to theoretical questions related to the search of 
private learning algorithms and the fundamental limit of machine learning under privacy 
constraints. 

4 Private learning for penalized ERM 

Now we describe a generic and practical class of private learning algorithms, based on the 
idea of minimizing the empirical risk under privacy constraint: 

1 

minimizeE(Z, h) = —'S^£{h,Zi)-Ign{h). (8) 

h^l-i Ti 

i=l 

The first term is empirical risk and the second term vanishes as n increases so that this 
estimator is asymptotically ERM. The same formulation has been studied before in the 
context of differentially private machine learning (Chaudhuri et ah, 2011; Kifer et ah, 2012), 
but our focus is more generic and does not require the objective function to be convex, 
differentiable, continuous, or even have a finite dimensional Euclidean space embedding, 
hence covers a larger class of learning problems. 

Our generic algorithm for differentially private learning is summarized in Algorithm 1. It 
applies the exponential mechanism (McSherry & Talwar, 2007) to penalized ERM. We 
note that this algorithm implicitly requires that exp{j^q{h, Z))dh < oo, otherwise the 


e -h sup 'Ez^ 




^hr^A{z)R{h, Z) - inf R{h, Z) 

/iG ri 


( 7 ) 
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Algorithm 1 Exponential Mechanism for regularized ERM 

Input: Data points Z = {zi,...,Zn} G loss function i, regularizer gn, privacy 
parameter e(n) and a hypothesis space Ti. 

1. Construct utility function q{h,Z) := — ^ Zj) — gn{h), and its sensitivity 

Aq := supi,^y^^a(Z,Z')=i kih, Z) - q{h, Z')\ < ^ \Kh, z)\. 

2. Sample h £% with probability F(h) oc exp(|^g(/i, Z)). 

Output: h. 


distribution is not well-defined and it does not make sense to talk about differential privacy. 
In general, if H is a compact set with a finite volume (with respect to a base measure, such 
as the Lebesgue measure or counting measure), then such a distribution always exists. We 
will revisit this point and discuss the practicality of this assumption in the Section 5.3. 

Using the characterization results developed so far, we are able to give sufficient conditions for 
consistency of private learning algorithms without having to establish uniform convergence. 
Define the snblevel set as 


Sz,t = {hen\ F{Z, h)<t + inf F{Z, h)} 


(9) 


h&n 


where F{h, Z) is the regularized empirical risk function defined in (8). In particular, we 
assume the following conditions: 

Al. Bounded loss function: 0 < £{h, z) < 1 for any h £ Ti, z £ Z. 

A2. Sublevel set condition: There exist constant positive integer no, positive real number 
to, and a sequence of regularizer g^ satisfying snp/jg-^ \gn{h)\ = o(n), such that for any 
0 < t < Iq, n > uq 



( 10 ) 


where K = K{n),p = p{n) satisfy log A -|- plogn = o(n). Here the measure p may 
depend on context, such as Lebesgue measure (A is continuous) or counting measure (A is 
discrete). 

The first condition of boundedness is common. It is assumed in Vapnik’s characterization 
for ERM learnability and Shalev-Shwartz et al.’s general characterization of all learnable 
problems. In fact, we can always consider A to be a sublevel set such that the boundedness 
condition holds. Eor the second condition, the intuition is that we require the sublevel set 
to be large enough such that the sampling procedure will return a good hypothesis with 
large probability. p{St) is a critical parameter in the utility guarantee for the exponential 
mechanism (McSherry & Talwar, 2007). Also, it is worth pointing out that A2 implies that 
the exponential distribution is well-defined. 
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*Dashed box works for any DP algorithms (Including Exp. Mech.) 



Figure 3: Illustration of Theorem 19: conditions for private learnability in general learning 
setting. 


Theorem 19 (General private learning). Let {Z, LL, i) be any problem in the general learning 
setting. Suppose we can choose gn such that A.l and A.2 are satisfied with {p, K, gn,no,to) 
for a distribution V, then Algorithm 1 satisfies e{n)-privacy and is consistent with rate 


C{n) 


9[logiF + (p + 2) logn] 
ne(n) 


+ 2e(n) + sup |pn(h)|. 
h&n 


( 11 ) 


In particular, if e{n) = o(l), sup/^g^ \gn{h)\ = o(l) and logiF + plogn = o{ne{n)) for allV 
(in D) Algorithm 1 privately learns (D-learns) the problem. 

We give an illustration of the proof in Figure 3. The detailed proof, based on the stability 
argument (Shalev-Shwartz et ah, 2010), is deferred to Appendix A.5. 

To see that Theorem 19 actually contains a large number of problems in the general learning 
setting. We provide concrete examples that satisfy Al and A2 below for both privately 
learnable and privately ID-learnable problems that can be learned using Algorithm 1. 


4.1 Examples of privately learnable problems 

We start from a few cases where Algorithm 1 is universally consistent for all distribu¬ 
tions. 

Example 20 (Finite discrete H). Suppose % can be fully encoded by M-bits, then 

p{St)/pi{n)>\n\-^ = 2-^, 

since there are at least 1 optimal hypothesis for each function and now p is the counting 
measure. In other word, we can take K = 2^ and p = 0 in the (11). Plug this into 
the expression and take gn = 0, e(n) = y^(M + logn)/n, we get a rate of consistency 
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^(n) = In addition, if we can find a data-independent covering set for a 

continuous space, then we can discretize the space and the result same results follow. This 
observation will be used in the construction of many private learning algorithms below. 
Example 21 (Lipschitz functions/Holder class). Let H be a compact, /dp-regular subset of 
satisfying n TL) > ldpg{B) for any ip ball B C that is small enough. Assume 
that F{Z, •) is L-Lipschitz on TL: for any h, h' G TL, 


\F{Z,h) - F{Z,h')\ < L\\h-h'\ 


p ■ 


Then for sufficiently small t, we have Lebesgue measure 

ffiSt)>fip{t/Lf 

and Condition A.2 holds with K = ffiTL) fdp^ L'^, p = d. Furthermore, if we take e{n) = 
^ d(logL+logn)+log(M(K)//3p) ^ is O ^ supjg^(/i)| 

consistent. 


h&V. 


This shows that condition A2 holds for a large class of low-dimensional problems of interest 
in machine learning and one can learn the problem privately without actually needing to 
find a covering set algorithmically. Specifically, the example includes many practically 
used methods such as logistic regression, linear SVM, ridge regression, even multi-layer 
neural networks, since the loss functions in these methods are jointly bounded in [Z, h) and 
Lipschitz in h. 

The example also raises an interesting observation that while differentially private classi¬ 
fication is not possible in a distribution-free setting for 0-1 loss function (Chaudhuri &: 
Hsu, 2011), it is learnable under smoother surrogate loss, e.g., logistic loss or hinge loss. In 
other words, private learnability and computational tractability both benefit from the same 
relaxation. 

The Lipschitz condition still requires the dimension of the hypothesis space to be o(n). 
Thus it does not cover high-dimensional machine learning problems where d^ n, nor does 
it contain the example of Shalev-Shwartz et al. (2010) that ERM fails. 

For high dimensional problems where d grows with n, typically some assumptions or 
restrictions need to be made either on the data or on the hypothesis space (so that it 
becomes essentially low-dimensional). We give one example here for the problem of sparse 
regression. 

Example 22 (Best subset selection). Consider TL = {h € : ||/i||o < s, \\h \\2 < 1} and 

let i{h, z) be an L-Lipschitz loss function. The solution can only be chosen from (^) < d® 
different s-dimensional subspaces. We can apply Algorithm 1 twice to first sample a support 
set S with utility function being the —m.m.h^'pgF{Z,h), and then sample a solution in 
the chosen s-dimensional subspace. By the composition theorem this two-stage procedure 
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is differentially private. Moreover, by the arguments in Example 20 and Example 21 
respectively, we have an > ( 3 ) for the subset selection and p.{St) > (j^)^ for the 

low-dimensional regression. Note that p = 0 in both cases and the dependency on the 
ambient dimension d is on the logarithm. The first stage ensures that for the chosen 
support set S, min/ig-^g F{Z, h) is close to F{Z, h) by expectation 

and ( the second stage ensures that the sampled hypothesis from TLs would have objective 
function close to min/jg-^g F{Z, h) by O ff ^ _ This leads to an overall 

rate of consistency (they simply add up) of O( ^+0+^os(pCHs)/hp) ^ choose 

e(n) = 1/ffn. 

4.2 Examples of privately 2)-learnable problems. 

For problems where private learnability is impossible to achieve, we may still apply The¬ 
orem 19 to prove the weaker private 2D-learnability for some specific class of distribu¬ 
tions. 

Example 23 (Finite Representation Dimension in the General Learning Setting). For 
binary classification problems with 0-1 loss (PAC learning), this has been well-studied. In 
particular, Beimel et al. (2013a) characterized the sample complexity of privately learnable 
problems using a combinatorial condition they call a ‘Probabilistic Representation”, which 
basically involves finding a finite, data-independent set of hypotheses to approximate any 
hypothesis in the class. Their claim is that if the “representation dimension” is finite, 
then the problem is privately learnable, otherwise it is not. We can extend the notion of 
probabilistic representation beyond the finite discrete and countably infinite hypothesis class 
considered in Beimel et al. (2013a) to cases when the problem is not privately learnable 
(e.g, learning threshold functions on [0,1]J. The existence of probabilistic representation for 
all distributions in 2 ) would lead to a ^ -universally private learning algorithm. 

Another way to define a class of distribution is to assume the existence of a reference 
distribution that is close to any distribution of interest as in Chaudhuri & Hsu (2011). 
Example 24 (Existence of a public reference distribution). To deal with the 0-1 loss 
classification problems on a continuous hypothesis domain, Chaudhuri & Hsu (2011) assume 
that there exists a data-independent referenee distribution V*, which by multiplying a fixed 
constant on its density, uniformly dominates any distributtion of interest. This essentially 
produces a subset of distributions T). The consequence is that one can build an e-net ofH 
with metric defined on the risk under T>* and this will also be a (looser) covering set of any 
distribution F € D, thereby learning the problem for any distribution in the set. 

The same idea can be applied to the general learning setting. For any fixed reference 
distribution F* defined on Z and constant c, 

T) = {F = {Z,E,F) I Fr){z G A) < cFj^ffz G A) for'iA G F] 
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is a valid set of distributions and we are able to D-privately learn this problem whenever we 
can construct a sufficiently small cover set with respect to T>* and reduce the problem to 
Example 20. This elass of problems includes high-dimensional and infinity dimensional prob¬ 
lems such as density estimation, nonparametric regression, kernel methods and essentially 
any other problems that are strictly learnable (Vapnik, 1998), since they are eharacterized 
by one-sided uniform eonvergence (and the corresponding entropy condition). 

4.3 Discussion on nniform convergence and private learnability 

Uniform convergence requires that sup/^g-^ \R{h, Z) — R{h)\ —)• 0 for any distribution 

T) with a distribution independent rate. Most machine learning algorithms rely on uniform 
convergence to establish consistency result (e.g., through complexity measure such as VC- 
dimension, Rademacher Complexity, covering and bracketing numbers and so on). In fact, 
the learnability of ERM algorithm is characterized by the one-sided uniform convergence 
(Vapnik, 1998), which is only slightly weaker than requiring uniform convergence on both 
sides. 

A key point in Shalev-Shwartz et al. (2010) is that the learnability (by any algorithm) 
in general learning setting is no longer characterized by variants of uniform convergence. 
However, the class of privately learnable problems is much smaller. Clearly, uniform 
convergence is not sufficient for a problem to be privately learnable (see Section 3.3), but is 
it necessary? 

In binary classification with discrete domain (agnostic PAC Learning), since VC-dimension 
being finite characterizes the class of privately PAC learnable problems, the necessity of 
uniform convergence is clear. This could also be more explicitly seen from Beimel et al. 
(2013a) where the probabilistic representation dimension is a form of uniform convergence 
on its own. 

In the general learning setting, the problem is still open. We were not able to prove that 
private learnability implies uniform convergence, but we could not construct a counter 
example either. All our examples in this section do implicitly or explicitly uses uniform 
convergence, which seems to hint at a positive answer. 


5 Practical concerns 

5.1 High confidence private learning via boosting 

We have stated all results so far in expectation. We can easily convert these to the 
high-confidence learning paradigm by applying Markov’s inequality, since convergence in 


23 


expectation to the minimum risk implies convergence in probability to the minimum risk. 
While the 1/5 dependence on the failure probability 5 is not ideal, we can apply a similar 
meta-algorithm “boosting” (Schapire, 1990) as in Shalev-Shwartz et al. (2010, Section 7) 
to get a log(l/5) rate. The approach is similar to cross-validation. Given a pre-chosen 
positive integer a, the original boosting algorithm randomly partitions the data into (a -|- 1) 
subsamples of size n/{a + 1), and applies Algorithm 1 on the first a partitions, obtaining a 
candidate hypotheses. The method then returns the one hypothesis with smallest validation 
error, calculated using the remaining subsample. To ensure differential privacy, our method 
instead uses the exponential mechanism to sample the best candidate hypothesis, where the 
logarithm of sampling probability is proportional to the negative validation error. 
Theorem 25 (High-confidence private learning). If an algorithm A privately learns a 
problem with rate f,{n) and privaey parameter e(n), then the boosting algorithm A! with 

a = log I is max |e f iog(3/n)+i ) > } -differentially private, its output h obeys 


R{h) -R* <ef 


n \ 
log(3/(5) l) 


^ /kiM) 

V n 


for an absolute eonstant C with probability at least 1 — 5. 


5.2 Efficient sampling algorithm for convex problems 

Our proposed exponential sampling based algorithm is to establish a more explicit geometric 
condition upon which AERM holds, hence the algorithm may not be computationally 
tractable. Ignoring the difficulty of constructing the e-covering set of an exponential number 
of elements, sampling from the set alone is not a polynomial time algorithm. But we can 
solve a subset of the continuous version of our Algorithm 1 described in Theorem 19 in 
polynomial time to arbitrary accuracy (see also Bassily et al. (2014, Theorem 3.4)). 
Proposition 26. If n~^ 9 n{h) is convex in h and R is a convex set, then 

the sampling procedure in Algorithm 1 can be solved in polynomial time. 

Proof. When n~^ + 9n{h) is convex, the utility function q{h, Z) is concave in 

h. The density to be sampled from in Algorithm 1 is proportional to exp( ^’^'^g’^^ ) and 
is log-concave. The Markov chain sampling algorithm in Applegate & Kannan (1991) is 
guaranteed to produce a sample from a distribution that is arbitrarily close to the target 
distribution (in the total variation sense) in polynomial time. ■ 

5.3 Exponential mechanism in infinite domain 

As we mention earlier, the results in Section 4 based on the exponential mechanism 
implicitly assumes certain regularity conditions that ensures the existence of a probability 
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distribution. 


When % is finite, the existence is trivial. On the other hand, an infinite set % is tricky 
in that there may not exist a proper distribution that satisfies P(/i) oc for at 

least some q{Z, h). For instance, if = M and q(Z, h) = 1 then e = oo. Such 

distributions that are only dehned up to scale with no hnite normalization constants are 
called improper distributions. In case of hnite dimensional non-compact set, this translates 
into an additional assumption on the loss function and the regularization term. 

Things get even trickier when 91 is an inhnite dimensional space, such as a subset of a 
Hilbert space. While probability measures can still be dehned, no density function can be 
dehned on such spaces. Therefore, we cannot use exponential mechanism to dehne a valid 
probability distribution. 

The practical implication is that exponential mechanism is really only applicable to cases 
when the hypothesis space 91 allows for dehnitions of densities in the usual sense, or then 
99 can be approximated by such a space. For example, a separable Hilbert space can be 
studied by hnite-dimensional projections. Also, we can approximate RKHS induced by 
translation invariant kernels via random Fourier features (Rahimi &: Recht, 2007). 


6 Results for learnability under (e, ^)-differential privacy 

Another way to weaken the dehnition of private learnability is through (e, (f)-approximate 
differential privacy. 

Definition 27 (Dwork et ah, 2006a). An algorithm A obeys {e,6)-dijferential privacy if 
for any Z, Z' such that d{Z, Z') < 1, and for any measurable set S C 99 

^hr^A{z){h G 5) < G 5) -h (5. 


We dehne a version of the problem to be 

Definition 28 (Approximately Private Learnability). We say a learning problem is A(n)- 
approximately privately learnable for some pre-specified family of rate A(n) if for some 
e < oo, 6{n) G A(n), there exists a universally consistent algorithm that is {e, 6{n))-DP. 

This is a completely different subject to study and the class of approximately privately 
learnable problems could be substantially larger than the pure privately learnable problems. 
Moreover, the picture may vary with respect to how small 5{n) is required to be. In this 
section, we present our preliminary investigation on this problem. 

Specihcally, we will consider two questions: 

1. Does the existence of an (e, (5)-DP always AERM algorithm characterize the class of 
approximately private learnable problems? 
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2. Are all learnable problems approximately privately learnable for different choices of 
A(n)? 

The minimal requirement in the same flavor of Definition 3 would be to require A(n) = 
{(5(n)|(5(n) —)> 0}. The learnability problem turns out to be trivial under this definition due 
to the following observation. 

Lemma 29. For any algorithm A that acts on Z, A! that runs A on a randomly chosen 
subset of Z of size y/n is (0, -^)-DP. 

Proof Let Z and Z' be adjacent datasets that differs only in data point i. For any i and 
any S G a{PL). 

F(AfZ) e S)= Fi{A{Zi) G S\i G I)F{i G I)+Fi{A{Zj) G S\i i /)P(i ^ I) 

= Fj{A{Zi) G S\i G I)F{i G I) + Fi{A{Z'j) G S\i i I)F{i ^ I) 

= F{A'iZ') G 5) + [FiiAiZi) G S\i G /) - FiiAfZi) G S\i G /)]P(i G I) 

< F{A'{Z') G 5) + P(i G I) 

= e^F{A'{Z') gS) + ^. 

y/n 

This verifies the (0, l/\/n)-DP of algorithm A!. ■ 

The above lemma suggests that if 5{n) = o(l) is all we need for the approximately private 
learnability, then any consistent learning algorithm can be made approximately DP by simply 
subsampling. In other words, any learnable problem is also learnable under approximate 
differential privacy. 

To get around this triviality, we need to specify a sufficiently fast rate of 5{n) going to 
0. While it is common to require that 5{n) = o(l/poly(n)) ^ for cryptographically strong 
privacy protection, requiring 5{n) = o(l/n) is already enough to invalidate the above 
subsampling argument and makes the problem of learnability a non-trivial one. 

Again, the question is whether AERM characterizes approximately private learnability and 
whether there is a gap between the class of learnable and approximately privately learnable 
problems. 

Here we show that the “folklore” Lemma 8 and subsampling lemma (Lemma 34) can 
be extended to work with (e, h)-DP and then we provide a positive answer to the first 
question. 

Lemma 30 (Stability of (e,(f)-DP). If A is {e,5)-DP, and 0 < i{h,z) < 1, then A is 
(e*^ — 1 + 6)-Strongly Uniform RO-stable. 

^Here the notation “o(l/poly(n))” means “decays faster than any polynomial of n”. A sequence 
a{n) = o(l/poly(n)) if and only if a{n) = o{n~'^) for any r > 0. 
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Proof. For any Z, Z' such that d{Z^ Z') < 1 and for any z ^ Z. Let the event E = {h\p{h) > 

\^h^A(z)Kh,z)= f i{h, z)p{h)dh - [ i{h,z)p{h)dh 

Jh Jh 

<supeih,z) f p{h) - p'{h)dh < f p{h) - p'{h)dh = ^ E) G E) 

h,z J E J E 

<(e^ — £ E) + S < — 1 + 5. 

The last line applies the definition of (e, (5)-DP. ■ 

Lemma 31 (Subsampling Lemma of (e, (5)-DP). If A is {e,5)-DP, then A' that acts on 
a random subsample of Z of size yn obeys {e' A')-DP with e' = log(l + 'ye^{e^ — 1)) and 
S' = 76^5. 

Proof For any event E £ let i be the coordinate where Z and Z' differs 

GE)= ^Fhr..A{Zi){h ~ E\i G I) + (1 - ~ E\i i I) 

=l^hr.^A(Zj){h ~ E\i G /) + (1 - l)Fhr..A(z'j){h ~ E\i ^ I) 

=7lF’ft~A(zp(^ ~ E\i G I)- 7lF’h~A(z;)(^ ~ E\i £ /) + 7lF’/i~A(z;)(^ ~ E\i £ I) 

+ (1 - lWh^A(z'j){h ~ E\i ^ I) 

=IF’/i~A'(Z') {h £ E) + 7[IPh~A(Z/) {hr^ E\i £ I) - Fhr..A{z'j) (h ~ i?|z G /)] 

ZFhr..A'(Z'){d £ E) + ~ E\i G /) + jS, (12) 

where in last line, we apply (e, 5)-DP of A. 

It remains to show that P/irv.A(z')(^ ~ E\i £ I) is similar to P/ir^A'(Z')(^ ^ E). First, 

^h'^A'{Z')ih G E) = lFi^^A(z'j)i^ ^ E\i G /) + (1 — 7)IF’/i~A(Z})(^ ^ E\i ^ I). (13) 

Denote Xi = {I\i G /}, X 2 = {I\i i !}■ We known |Xi| = and IX 2 I = and 

IX 1 I/IX 2 I = ^nl{n — 7 n). For every / G X 2 there are precisely yn elements J G X\ such 
that d(/, J) = 1. Likewise, for every J G Xi, there are n — ^n elements I G I 2 such that 
d{I,J) = 1. It follows by symmetry that if we apply (e,(I)-DP to l/yn of each I £ I 2 
and change I to their corresponding J G Xi, then each J G Xi will receive (n — jn) /yn 
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’contribution” in total from the sum over all I G X 2 . 


^ E\i ^ I) = — Y, ^h^A{Z'^)ih G E) 


/eXa 


1 1 
' ^i/ei2i=i ' 


>axj: 


\^2\\Ti\^^ in 
1 


n — in _, 
-e 


^A{Z'){hGE)-6) 


E 


-A{Z'){h e E) — 6) — e G E\i G /) — e ^6 


' J&h 

Substitute into (13), we get 


(1 - 7 )e“ 


\^A{Z'j){h G E\i G /) < G ^) + 


-(5. 


We further relax the upper bound to a simple form G X) + <5 and substitute 

into (12), we have 

IF’?t~yl'(z)(^ G X) < (1 + 76 ^( 6 *^ — l))P/j^_ 4 /(^/)(/i G -E) + 7<5 + 7 ( 6 ^ — 1)(5, 
which concludes the proof. ■ 


Using the above two lemmas, we are able to establish the same result which says that 
AERM characterizes the approximate private learnability for certain classes of A(n). 
Theorem 32. A problem is A{n)-approximately privately learnable implies that there 
exists an always AERM algorithm that is {e{n),n~^^‘^e^6{y/n))-DP for some e{n) —)■ 0 and 
d{y/n) G A(n). The converse is also true if n~^^‘^e'^6{^/n) G A(n). 

Proof. If we have an always AERM algorithm with f,erm{n) that is (e(n), 5(n))-DP for 
<5(n) G A(n). Then by Lemma 30, this algorithm is strongly uniform RO-stable with 
rate — 1 + 5{n). By Theorem 35, the algorithm is universally consistent with rate 
f,erm{n) + — 1 + 5{n). This establishes the “if” part. 

To see the “only if” part, by definition if a problem is A(n)-approximately privately learnable 
with e and S{n) G A(n). Then by Lemma 31 with 7 = Ify/n, we get an algorithm that 
obeys the privacy condition. It remains to prove always AERM, which requires exactly the 
same arguments in the proof of Lemma 10. Details are omitted. ■ 

Note that the results above suggest that in the two canonical settings A(n) = o(l/n) or 
A(n) = o(l/poly(n)), existence of a private AERM algorithm that satisfies the stronger 
constraint e(n) = o(l) characterizes the learnability. 
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The next question that whether any learnable problems are also approximately privately 
learnable would depend on how fast 5{n) is required to decay. We know that when we only 
have A(n) = o(l), all learnable problems are approximately privately learnable, and when 
we have A(n) = {0}, only a strict subset of these problems is privately learnable. The 
following result establishes that when 6{n) needs to go to 0 with a sufficiently fast rate, 
there is separation between learnability and approximately private learnability. 
Proposition 33. Let A(n) = {(5(n)|(f(n) < S{n)} for some sequence 5{n) —)• 0. The 
following statements are true. 

• All learnable problems are A{n)-approximately privately learnable, if 6{n) = uj{l/n). 

• There exists a problem that is learnable but not A{n)-approximately privately learnable, 
ifS{n) < 

Proof. The hrst claim follows from the same argument in Lemma 29. If a problem is 
learnable, there exists a universally consistent learning algorithm A. The algorithm that 
applies ^ on a (5(n)-fraction random subsample of the dataset is (0, (5(n))-DP and universally 
consistent with rate f,{n5{n)). Since 6{n) = ui{l/n), n5{n) —>• oo. 

We now show that when we require a fast decaying 5(n), then suddenly the example in 
Section 3.3 due to Chaudhuri & Hsu (2011) becomes not approximately privately learnable 
even for (e, (5)-DP. Let Z, Z' be two completely different data sets, by repeatedly applying 
the definition of (e, (5)-DP, for any set S C Ti 

n 

F(A(Z) eS)< e^^F{A{Z) E 5) + ^ < P^^F{A{Z') E 5) + ne^^-^^^5. 

i=l 

When we shift the inequality around, we get 

F{A{Z') gS)< e-^^F{A{Z') E 5) - e-^nS. 

Consider the same example in Section 3.3 where we hope to learn a threshold on [0,1]. 
Assuming there exists an algorithm A that is universally AERM and (e(n), 5(n))-DP for 
e(n) < oo and 5{n) < 0.4ne“'^"'. 

Everything up to (4) remains exactly the same. Now, apply the above implication of 
(e, (5)-DP, we can replace (4) for each i = 2, ...,K, by 

F{A{Zi) E [hi - r]/3, hi + 77 / 3 ]) > exp(-en)P(A(Zi) E [hi - rj/S, hi + p/3]) - n6{n). 

Then (5) becomes 

F{A{Zi) ^ [hi — p/3, hi + p/3]) > K exp(—en)0.9 — Ke~'^n6{n) > 0.9 > 0.5, 
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Figure 4: Illustration of Proposition 33 and the open problem. 


where the last inequality follows by iF > exp(en) and d(n) < 0.4ne“’^"’. This yields the same 
contradiction to always AERM of ^ on Zi, which requires ¥{A{Zi) ^ [hi — r]/3, hi + r]/3]) < 
0.1. Therefore, such AERM does not exist. By the contrapositive of Theorem 32, the 
problem is not approximately privately learnable for 6{n) < )”■ ) ^ _ 


The bound can be further improved to exp(—e(n)n)/n if we directly work with universal 
consistency on various distributions rather than through always AERM on specific data 
points. Even that is likely to be suboptimal as there might be more challenging problems 
and less favorable packings to consider. 

The point of this exposition, however, is to illustrate that (e, h)-DP alone does not close the 
gap between learnability and private learnability. Additional relaxation on the specified rate 
of decay on 6 does. We now know that the phase transition occurs when 5{n) is somewhere 
between n(exp(—logn)) and 0(l/n); but there is still a substantial gap between the 
upper and lower bounds. 


7 Conclusion and future work 

In this paper, we revisited the question “What can we learned privately?” and considered 
a broader class of statistical machine learning problems than those studied previously. 
Specifically, we characterized the learnability under privacy constraint by showing any 
privately learnable problems can be learned by a private algorithm that asymptotically 
minimizes the empirical risk for any data, and the problem is not privately learnable 
otherwise. This allows us to construct a conceptual procedure that privately learns any 
privately learnable problem. We also propose a relaxed notion of private learnability called 
private 2)-learnability, which requires the existence of an algorithm that is consistent for any 
the distribution within a class of distributions 3. We characterized private S-learnability 
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too with a weaker notion of AERM. For problems that can be formulated as penalized 
empirical risk minimization, we provide a sampling algorithm with a set of meaningful 
sufficient conditions on the geometry of the hypothesis space and demonstrate that it covers a 
large class of problems. In addition, we further extended the characterization to learnability 
under (e, (f)-differential privacy and provided a preliminary analysis which establishes the 
existence of a phase transition from all learnable problems being approximately private 
learnable to some learnable problems being not approximately private learnable at some 
non-trivial rate of decay on 6{n). 

Future work includes understanding the conditions under which privacy and AERM are 
contradictory (recall that we only have one example on learning thresholding functions due 
to Chaudhuri &: Hsu 2011), characterizing the rate of convergence, searching for practical 
algorithms that generically learns all privately learnable problems, and better understanding 
the gap between learnability and approximate private learnability. 
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A Proofs of technical results 

In this appendix, we provide detailed proofs to the technical results that in the main 
text. 


A.l Privacy in subsampling 

Proof of Lemma 4- Let A be the consistent e-DP algorithm. Consider A' that apply A to 
a random subsample of [\/nJ data points. By Lemma 34 with 7 = we get the 

privacy claim. For the consistency claim, note that the given sample is an iid sample of size 
y/n from the original distribution. ■ 

Lemma 34 (Subsampling theorem). If Algorithm A is e-DP for Z G for any n = 
1 , 2 , 3,..., then the algorithm A! that output the result of A to a random subsample of size 
yn data points preserves — e~’^)-DP. 
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Proof of Lemma 34 (Subsampling theorem). This is a corollary of Lemma 4.4 in Beimel 
et al. (2014). To be self-contained, we reproduce the proof here in our notation. 

Recall that A! is the algorithm that first randomly subsample yn data points then apply 
A. Let Z and Z' be any neighboring databases and assume they differ on the ith data 
point. Let S C [n] be the indices of the random subset of the entries that are selected, and 
TZ C [ra]\{i} be a index size of size yn — 1. We apply the law of total expectation twice and 
argue that for any adjacent Z, Z', any event E dPL, 

^h^A'{z){h e E) ^ 'y^h~A{Zs)(^ ^ ^1* e 5) -I- (1 - 'y)^hr.-A{Zs)i^ ^ ^ 

^h~A'iZ'){h G E) G £’11 G 5) -I- (1 - 'y)^h^A{Z'g){h g E\i ^ S) 

_ E7eeM\p}JP’(^) b^hr.^A{Zs) jh € E\S = Teu {!}) + {1 - iWh^A(Zs) € E\S = 7eu {j},i A i)] 

I^KeW\{i}[7lPw(x^) (heE\S = n\J {!}) -h (1 - y)Pw(z^) (/i G £|5 = 7^U {j},i A *)] 

By the given condition that A is e-DP, we can replace T^U {i} with TZU {j} for an arbitrary 
j with bounded changes in the probability and the above likelihood ratio can be upper 
bounded by 

ye*^-!-!—y l-|-y(e^ —1) 

(ye-^-bl-y)E7^g[„]\{i}J-^iP;,^_4(2^)(hG£|5=7^U{j}) ~ ye-^-hl-y ~ l-|-y(e-^-l)' 

By definition, the privacy loss of the algorithm A! is therefore 

e' < log(l -Fy[e^ - 1]) - log (l -by [e"^ - l]) • 


Note that e > 0 implies that —l<e ^ — 1<0 and 0 < — 1 < oo. The result follows by 

applying the property of the natural logarithm: 


X 2 -b X 

log(l -bx) < xw— < a; 

^ X ~i fC 

, , , X 2 -b X X 

log(l -bx) > -- > 


2 1 -b X 1 -b X 


for 0 < X < oo 
for — 1 < X < 0 


to upper bound the expression. 


A.2 Characterization of private learnability 

Privacy implies stability Lemma 8 says that an e-differentially private algorithm is 
{P — l)-stable (and also 2e-stable if e < 1). 
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Proof of Lemma 8. Construct Z' by replacing an arbitrary data point in Z with z' and let 
the probability density/mass defined by A{Z) and A{Z') be p{h) and p'{h) respectively, 
then we can bound the stability as follows 


z) - z)\ 

[ £{h, z)p{h)dh — [ i{h, z)p {h)dh 

Jh Jh 


i{h, z){p{h) — p {h))dh 


< sup \£{h, z)\ 

h,z 

<fe" - 


[ p{h)-p{h)dh<l- f - l)dh 

lp{h)>p'{h) JP 


(e^ — 1) / p\h)dh < {P — 1). 

Jp(h)>p'(h) 


For e < 1 we have exp(e) — 1 < 2e. 


Stability + AERM consistency 

Theorem 35 (Randomized version of Shalev-Shwartz et al. 2010, Theorem 8). 

If any algorithm is f,i{n)-stable and -AERM then it is consistent with rate f,{n) = 

?i(n) + 6(n)- 


Proof We will show the following the two steps as in Shalev-Shwartz et al. (2010) 

1. Uniform RO stability On average stability On average generalization 

2. AERM -|- On average generalization consistency 
The definition of these quantities is self-explanatory. 

To show that “stability implies generalization”, we have 


( 1 

'^Zr^V^hr-^A(Z)^{h^ Z) - '^hr^AiZ) ^ ^(^) 

^ * = 1 

/ 1 n ^ n 

( -''^^hr^A(z)^{h, z[) - - z'i) 

\ i=l i=l / 

A - E;j__4(^(i))^(h, A) < ixin) , 


< sup 

z,z(')ez",(i(z,z('))=i,z'ez 
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where is obtained by replacing the iih. entry of Z with z[. Next, we show that 
“generalization and AERM implies consistency”. Let h* G arginf/^g-^ By definition, 

we have 'Ezr~^'D^R{h*, Z) = R*. It follows that 

- R*] = IEz~D"[IE/ie^(z)R(h) - R{h*, Z)] 

= IEz~D"[IE/ie^(z)R(h) — Z)] +Ezr~.v”'\j^heA{z)^ih, Z) — R{h*, Z)] 

< Ezr~.v"[E^^^J^(^z)Rih) - Z)] +Ezr~.v^[Eh^_4^(^z)Mh, Z) - R*{Z)] 

< ^i{n) + ^2{n). 


Privacy + AERM consistency 

Proof of Corrollary 9. It follows by combining Lemma 8 and Theorem 35. ■ 


Necessity 


Proof of Lemma 10. We construct an algorithm A' by subsampling the data points using 
a random subset of y/n and then running A. The privacy claim follows from Lemma 34 
directly. 

To prove the “always AERM” claim, we adapt the proof of Lemma 24 in Shalev-Shwartz 
et al. (2010). For any fixed data set Z G Z®®, 


< 


^Z'cZ,\Z'\ = l^\ 

R{A{Z’),Z)-R*{Z) 

®'Z'~Unif(Z)Lv^J 

R{A{Z'),Z)-R*{Z)\ 

®Z'~Unif(Z)LxAJJ 

R{A{Z'),Z)- R*{Z) 

P(no 

duplicates) 


where Unif(Z) is the uniform distribution defined on the n points in Z. We need to 
condition on the event that there are no duplicates for the second equality to hold because 
Z' is a subsample taken without replacements. The last inequality is by the law of total 
expectation and the non-negativity of the conditional expectation. But P(no duplicates) = 
^(1 — i/n) > 1 — ^ i/n > 1/2. By universal consistency, A is consistent on 

the discrete uniform distribution defined on Z, so 


R{A (Z),Z) R*{Z) < 2E^,^unif(z)L"J 


R{A{Z'),Z)-R*{Z) <2C{V^). 
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It is obvious that is consistent with rate \fn as it applies ^ on a random sample of size 
\/n. By Lemma 4, is — e“^) differentially private. By Corollary 9, the new 

algorithm is universally consistent. ■ 


A.3 Proofs for Section 3.3 


Proof of Proposition 11. If A{Z) is a continuous distribution, we can pick h € P at any 
point where A{Z) has finite density and set A!{Z)\z G Z to be /i with probability I/n and 
the same as A{Z) with probability 1 — 1/n. This breaks privacy because conditioned on 
two databases with z or without z, A, the probability ratio of outputting h is oo. 


If A{Z) is a discrete distribution or a mixed distribution, it must have the same support of 
the point mass for all Z. Otherwise it violates DP because we need < exp(ne) 

^h£A(Z') ^ ' 


for any Z, Z' G . Specifically, let the discrete set of point mass be PL if PL\H A 0; then 
we can use the same technique as in the continuous case by adding a small probability 1/n 
on PXK when z ^ Z. 


UP = P, then P is a discrete set, if \P\ < n, then by boundedness and Hoeffding, ERM is 
a deterministic algorithm that learns any learnable problem. On the other hand, U \P\ > n, 
then by pigeon hole principle, there always exists a hypothesis h that has probability smaller 
than 1/n in A{Z) for any Z G Z"^ and we can construct A' by outputting a sample of A{Z) 
if z is not observed and outputting a sample A{Z)\A{Z) A h whenever z is observed. 


The consistency of Al follows easily as its risk is at most 1 /n larger than that of A. ■ 


A.4 Proofs for characterization of private D-learnability 

Proof of Lemma If. Let A! be the algorithm that applies A to a random subsample of size 
[\/nJ. If we can show that, for any 2? G T>, 

(a) the empirical risk of A' converges to the the optimal population risk P* in expectation; 

(b) the empirical risk of the ERM learning rule also converges to R* in expectation, 

then by triangle inequality, the empirical risk of A' must also converge to the empirical risk 
of ERM, i.e.. A! is iD-universal AERM. 
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We will start with (a). For any distribution D € D, we have 






IV^\ 

n 

n 


R{A{Z'), Z') + E^„_^j,n-Lv/sj 
n - [^/n\ 


n - [VnJ 


n 


R{A{Z'),Z") 


R{A{Z'),Z') + 


n 


R{A{Z')) 


< +R* (14) 

'n 


The last inequality uses the boundedness of the loss function to get R{A{Z'), Z') < 1 and 
the ^-consistency of A to bound the excess risk of EziR{A{Z')). 

To show (b), we need to exploit the assumption that the problem is (non-privately) learnable. 
By Shalev-Shwartz et al. (2010, Theorem 7), the problem being learnable implies that 
there exists a universally consistent algorithm B (not restricted to 0), that is universally 
AERM with rate 2>^'[n^) + and stable with rate Moreover, by Shalev-Shwartz et al. 
(2010, Theorem 8), B’s stability and AERM implies that B is also generalizing, with rate 
6^'(nJ) -|- Here the term “generalizing” means that the empirical risk is close to the 
population risk. Therefore, we can establish (b) via the following chain of approximations 

Generalization of B 

Ez^v^R*{Z) « Ez^v^R{B{Z), Z) « R{B{Z)) « R*. 

t t 

AERM of B Consistency of B 


More precisely. 


< 


Ez^v^R*{Z) - R* 

Ez^v^R*{Z)-Ez^v-R + Ez^v-R- R{B{Z),Z) + \R{B{Z), Z) - R* 


<[3e'(ni) + ^] + [6e'(ni) + ^] + [3e'(ni) + ^] = 12e(ni) + 

\/Tl v 71 A/ 77 A/ 71 


(15) 


Combine (14) and (15), we obtain the AERM of A' with rate 12^'(n^/'^) + ^ + as 

required. The privacy of A! follows from Lemma 34. ■ 


A.5 Proof for Theorem 19 

We first present the proof for Theorem 19. Recall that the roadmap of the proof is 
summarized in Figure 3. 

For readability, we denote e(n) by simply e. 
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Recall that the objective function is F{h, Z) = ^ Zi)+gn{h) and the corresponding 

utility function q{h,Z) = —F{h,Z). By the boundedness assumption, it is easy to show 
that if we replace one data point in any Z with something else, then sensitivity 

Aq = sup \q{Z,h) — q{Z',h)\ < —. (16) 

h£n,d{Z,Z') = l ^ 

Then by McSherry &: Talwar (2007, Theorem 6 ), Algorithm 1 that outputs h £ 71 with 
P(/i) oc exp( 2 ^<?(/i, Z)) naturally ensures e-differential privacy. 

Denote shorthand F* := inf T’(Z,/i) and q* := —F*, we can state an analog of the 
utility theorem of the exponential mechanism in (McSherry & Talwar, 2007). 

Lemma 36 (Utility). Assuming e < log re (otherwise the privacy protection is meaningless 
anyway), if assumption Al, A2 hold for distribution V, then 

w w ^ T?* 9[(p-h2)logre-hlogA:] 

> —¥.Zr^V^F — -—-.. (17) 


Proof. By the boundedness of i and g 


1 


q{Z, h) = -V i{h, Zi) - gn{h) >-{1 + C(n)). 

n 


By Lemma 7 in McSherry & Talwar (2007) (translated to our case), 
^hr^Alz) [q{Z, h) < -F* - 2t] < 


(18) 


Apply (16), take expectation over the data distribution on both sides, and applying 
assumption A2, we get 


^z^v^'^h^A{z) [q{Z, h) < -F* -2t]<Kt Pe 4 = e 4 




(19) 


Take t = " ^ assumption that e < log re, we get log(ret) > 0. 

Substitute t into the expression of 7 we obtain 


ere 


7 = —t — log K + p log t = 2 log n + p log(ret) > 2 log re. 


and therefore 


'Z.Zr..V''^^h^A(z) Vl{Z, h) < -F* - 2t] < 


-2 


re 
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Denote < ~F* ~ 2t] =: p, we can then bound the expectation from below 

as follows: 


EZr^V"^hr~^A{Z)Q{Z,h) >EZr^V"{—F* 
>KZr^'D’'^{~F* 

> — EZr^V'^F* 

> — EZr^V'^F* 


2t){l—p)+ min q{Z,h)Ezr^v"P 

h&n,z&z^ 

2t) + (-1 - C(n)) n“^ 

8 [(p + 2)logn + log(iv:)] . 

—^^ - (1 + C{n)) n 

en 

9 [{p + 2) log n + log{K)] 


Now we can say something about the learning problem. In particular, the AERM follows 
directly from the utility result and stability follows from the definition of differential 
privacy. 

Lemma 37 (Universal AERM). Assume Al and A2, and e < logn (so Lemma 36 holds), 
then 


IEz~D" 


Eh^,^^z)kKZ)-R*{Z) 


9[(p + 2)logn + log(l/A)] 

-h C{n 

ne 


Proof. This is a simple consequence of boundedness and Lemma 36. 




Ef,^j^^Z)R{h,Z)-R*{Z) 


=Ezr^v^Ehr^MZ) - i{h, Zi) - inf - i{h, Zi) 


<Ez^v^Eii^_^i^z) 


— Ezr^v” inf 

h 


^£{h,Zi) +gn{h) 
i 

-'^^{h,Zi) +gn{h) 

71 < ^ 


- '^h^A{Z)9nih) 


+ sup{gnih)) 


=Ezr^v^i-F* -Ehr^_A^z)QiZ, h)) + supgn{h) -Ehr..A(z)gnih) 

h 


^ 9[(p + 2)logn + log(l/A)] ^ 

~ ne 

The last step applies Lemma 36 and sup/j \gn{h)\ < C,{n) as in Assumption A2 by using the 
fact that supi^^gnih) — Egn{h) < 2sup^ \gnih)\ for any distribution of h the expectation is 
taken over. ■ 


The above theorem shows that Algorithm 1 is asymptotic ERM. By Theorem 8, the fact 
that this algorithm is e-differential private implies that it is 2e-stable. Now the proof 
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follows by applying Theorem 35 which says that stability and AERM of an algorithm 
certify its consistency. Noting that this holds for any distribution T> completes our proof 
for learnability in Theorem 19. 


A.6 Proofs of other technical results 


High confidence private learning. 


Proof of Theorem 25. The algorithm A privately learns the problem with rate ^(n) implies 
that 

- R* < ^(n). 

Let h ~ A{Z) and Z ~ by Markov’s inequality, with probability at least 1 — 1/e, 

R{h)-R* < eC{n). 


If we split the data randomly into a + 1 parts of size n/{a + 1) and run A on the first a 
partitions, then we get hj ~ A{Zj). Then with probability at lest 1 — (l/e)“, at least one 
of them has risk 

Tl 

min R{hj) - R* < (20) 

j£[a] a + i 

Since the (a + l)th partition are iid data, and i is bounded, we can apply Hoeffding’s 
inequality and union bound, so that with probability 1 — for all j = 1,..., a + 1 


R{hj, Za+i) - R{hj) < 


log(2a/(5i) 

2n 


( 21 ) 


This means that if exponential mechanism picked the one with the best validation risk it 
will be almost as good as the one with the best risk. Assume hi is the one that achieves 
the best validation risk. 


Now it remains to bound the probability that exponential mechanism pick an /i G {hi, ...,ha} 
that is much worse than hi. 


Recall that the utility function is the negative validation risk which depends only on the 
last partition R+i. 


q{X,h) 


1 

n/(a + 1) 


^ ^ ^iiT'ii hf 

*S/a+l 


This is in fact a random function of the data because we are picking the the validation set 
la+i randomly from the data. Suppose we arbitrarily replace one data point j from the 
dataset, the distribution of the output of function q{Z, h) is a mixture of the two cases: 
j G la+i and j ^ la+i- Since in the first case, q{Z, h) = q{Z', h) for all h, sensitivity for this 
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case is 0. In the second case, by the boundedness assumption, the sensitivity is at most 
2(a + l)/n. For the exponential mechanism guarantee e differential privacy, it suffices to 
take the sensitivity parameter to be 2 (a + l)/n. 

By the utility theorem of the exponential mechanism. 


R{h) > R{hi) + 


8(7/log n + logo) 
en/{a + 1) 


< n-^. 


Combine ( 20 )( 21 ) and( 22 ) we get 


( 22 ) 


Oj \ 


+ 8hlogn + loga) 


2 n 


en/{a + 1) 


< n ^ + 61 + e 


Now by appropriately choosing rj = log(3/(i)/logn, a = log(3/(5), = 6/3, we get 


R{h) -R* > e^i 


n 


log(3/5) + 1 


) + 


log(21og(3/(I)) + log(3/(i) 
2 n 


+ 


8(log(3/(i) + loglog(3/(i)) 
en/(log(3/5) + 1) 


< (5 


combine the terms and take e = ^e get the bound of the excess risk in the 

theorem. 

To get the privacy claim, note that we are applying A on disjoint partitions of the data so 
the privacy parameter does not aggregate. Take the worst over all partitions, we get the 


overall privacy loss max | e ^ 


log(3/n)+l 


iog(3/£)+i'l stated in the theorem. 
’ J 


The Lipschitz example. 

Proof of Example 21 . Let h* G argmin^jg-^ F(Z,/i), the Lipschitz condition dictates that 
for any h, 

\F{h)-F{h*)\ < L\\h-h*\\p. 

Choose a small enough t < to such that h is in the small neighborhood of h* , and we can 
construct a function F that within the sublevel set St, such that the above inequality (when 
we replace F with F) is equality, then for any h G St^, F{h) > F{Z,h). Verify that the 
sublevel set of F{h), denoted by St always contains St- In addition, we can compute the 
measure p{St) explicitly, since the function is a cone and 

L\\h - h*\\p = \F{h) - F{h*)\ = F{h) - F{h*) < t, 
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therefore 


St = {h\ L\\h-h*\\p < t}. 

Since Ti is /3p-regular, n > (5p^{B) for any Ip ball B C the measure of the 
sublevel set can be lower bounded by f3p times the volume of the Ip ball with radius t/L 
and since St ^ St, we have 

l^{St) > {B{t/L)) = /3p [t/lf 


as required. 


B Alternative proof of Corollary 9 via Dwork et al. (2015b, 
Theorem 7) 


In this Appendix, we describe how the results in Dwork et al. (2015b) can be used to obtain 
the forward direction of our characterization without going through a stability argument. 
We first restate the result here in our notation: 

Lemma 38 (Theorem 7 in Dwork et al. 2015b). Let B be an e-DP algorithm such that 
given a dataset Z, B outputs a function from Z to [0,1]. For any distribution D over 
Z and random variable Z ~ P”, we let cj) ~ B{Z). Then for any /3 > 0, r > 0 and 
n > 12 log(4//3)/r^, setting e < t/2 ensures 


4>r^B{Z),Zr^V'^ 


^zr~^v4>iz) — V' 4>{z) 
n 


z&Z 


> T 


</ 3 . 


This lemma was originally stated to prove the claim that privately generated mechanisms 
for answering statistical queries always generalize. 

For statistical learning problems, we can simply take the statistical query cj) to be the 
loss function i{h, ■) parameterized hy h € T-L. If an algorithm A that samples from a 
distribution on B. upon observing data Z is e-DP, then B : Z ^ i{A{Z), •) is also e-DP. 
The result therefore reduces to that the empirical risk and population risk are close with 
high probability. Due to the boundedness assumption, we can translate the high probability 
result to the expectation form, which verifies the definition of “generalization”. 

However, “generalization” alone still does not imply “consistency”, as we also need 

IE<a~b(z)- f{z) R* = inmEzr^v4>iz) 
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as Z gets large, which does not hold for all DP-output 4>. But when (j) = i{h, •), it can be 
obtained if we assume A is AERM. This is shown via the following inequality 

- V' 4>{z) Ezgx,n min - ^ 4>{z) < {z) = E<i)* (z) = R *, 

z&Z z&Z z&Z 

where (j)* = i{h*, •) and h* is an optimal hypothesis function. This wraps up the proof of 
consistency. 

The above proof of “consistency” via Lemma 38 and “AERM”, however, leads to a looser 
bound comparing to our result (Corollary 9) when the additional assumption on n and 
r (equivalently e) is active, i.e., when < O this case it only implies a 

^(n) -|- bound due to that e-DP implies e'-DP for any e' > e. Our proof of Corollary 9 
is considerably simpler and more general in that it does not require any assumption on the 
number of data points n. 

This can easily lead to worse overall error bound for very simple learning problems with 
sufficiently fast rate. For example, in the problem of learning the mean of A G [0,1], let the 
loss function be \x—h\^^. Consider the e(n)-DP algorithm that outputs ERM-|-Laplace( 

where e(n) is chosen to be This algorithm is AERM with rate ^(ra) = = 

0(n~^). By Corollary 9 we get an overall rate of while through Lemma 38 and 

the argument that follows, we only get 0(n“^/^). 
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